20 research outputs found

    Scruples: A Corpus of Community Ethical Judgments on 32,000 Real-Life Anecdotes

    Full text link
    As AI systems become an increasing part of people's everyday lives, it becomes ever more important that they understand people's ethical norms. Motivated by descriptive ethics, a field of study that focuses on people's descriptive judgments rather than theoretical prescriptions on morality, we investigate a novel, data-driven approach to machine ethics. We introduce Scruples, the first large-scale dataset with 625,000 ethical judgments over 32,000 real-life anecdotes. Each anecdote recounts a complex ethical situation, often posing moral dilemmas, paired with a distribution of judgments contributed by the community members. Our dataset presents a major challenge to state-of-the-art neural language models, leaving significant room for improvement. However, when presented with simplified moral situations, the results are considerably more promising, suggesting that neural models can effectively learn simpler ethical building blocks. A key take-away of our empirical analysis is that norms are not always clean-cut; many situations are naturally divisive. We present a new method to estimate the best possible performance on such tasks with inherently diverse label distributions, and explore likelihood functions that separate intrinsic from model uncertainty.Comment: 18 pages, 14 tables, 18 figures. Accepted to AAAI 2021. For associated code and data, see https://github.com/allenai/scruple

    UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark

    Full text link
    Commonsense AI has long been seen as a near impossible goal -- until recently. Now, research interest has sharply increased with an influx of new benchmarks and models. We propose two new ways to evaluate commonsense models, emphasizing their generality on new tasks and building on diverse, recently introduced benchmarks. First, we propose a new multitask benchmark, RAINBOW, to promote research on commonsense models that generalize well over multiple tasks and datasets. Second, we propose a novel evaluation, the cost equivalent curve, that sheds new insight on how the choice of source datasets, pretrained language models, and transfer learning methods impacts performance and data efficiency. We perform extensive experiments -- over 200 experiments encompassing 4800 models -- and report multiple valuable and sometimes surprising findings, e.g., that transfer almost always leads to better or equivalent performance if following a particular recipe, that QA-based commonsense datasets transfer well with each other, while commonsense knowledge graphs do not, and that perhaps counter-intuitively, larger models benefit more from transfer than smaller ones. Last but not least, we introduce a new universal commonsense reasoning model, UNICORN, that establishes new state-of-the-art performance across 8 popular commonsense benchmarks, aNLI (87.3%), CosmosQA (91.8%), HellaSWAG (93.9%), PIQA (90.1%), SocialIQa (83.2%), WinoGrande (86.6%), CycIC (94.0%) and CommonsenseQA (79.3%).Comment: 27 pages, 19 figures, 34 tables. Accepted to AAAI 2021. For associated code and data see https://github.com/allenai/rainbo

    ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning

    Full text link
    We present ATOMIC, an atlas of everyday commonsense reasoning, organized through 877k textual descriptions of inferential knowledge. Compared to existing resources that center around taxonomic knowledge, ATOMIC focuses on inferential knowledge organized as typed if-then relations with variables (e.g., "if X pays Y a compliment, then Y will likely return the compliment"). We propose nine if-then relation types to distinguish causes vs. effects, agents vs. themes, voluntary vs. involuntary events, and actions vs. mental states. By generatively training on the rich inferential knowledge described in ATOMIC, we show that neural models can acquire simple commonsense capabilities and reason about previously unseen events. Experimental results demonstrate that multitask models that incorporate the hierarchical structure of if-then relation types lead to more accurate inference compared to models trained in isolation, as measured by both automatic and human evaluation.Comment: AAAI 2019 C

    GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

    Full text link
    Leaderboards have eased model development for many NLP datasets by standardizing their evaluation and delegating it to an independent external repository. Their adoption, however, is so far limited to tasks that can be reliably evaluated in an automatic manner. This work introduces GENIE, an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks. GENIE automatically posts leaderboard submissions to crowdsourcing platforms asking human annotators to evaluate them on various axes (e.g., correctness, conciseness, fluency) and compares their answers to various automatic metrics. We introduce several datasets in English to GENIE, representing four core challenges in text generation: machine translation, summarization, commonsense reasoning, and machine comprehension. We provide formal granular evaluation metrics and identify areas for future research. We make GENIE publicly available and hope that it will spur progress in language generation models as well as their automatic and manual evaluation

    Instrumental performance and results from testing of the BLAST-TNG receiver, submillimeter optics, and MKID arrays

    Full text link
    Polarized thermal emission from interstellar dust grains can be used to map magnetic fields in star forming molecular clouds and the diffuse interstellar medium (ISM). The Balloon-borne Large Aperture Submillimeter Telescope for Polarimetry (BLASTPol) flew from Antarctica in 2010 and 2012 and produced degree-scale polarization maps of several nearby molecular clouds with arcminute resolution. The success of BLASTPol has motivated a next-generation instrument, BLAST-TNG, which will use more than 3000 linear polarization sensitive microwave kinetic inductance detectors (MKIDs) combined with a 2.5m diameter carbon fiber primary mirror to make diffraction-limited observations at 250, 350, and 500 μ\mum. With 16 times the mapping speed of BLASTPol, sub-arcminute resolution, and a longer flight time, BLAST-TNG will be able to examine nearby molecular clouds and the diffuse galactic dust polarization spectrum in unprecedented detail. The 250 μ\mum detector array has been integrated into the new cryogenic receiver, and is undergoing testing to establish the optical and polarization characteristics of the instrument. BLAST-TNG will demonstrate the effectiveness of kilo-pixel MKID arrays for applications in submillimeter astronomy. BLAST-TNG is scheduled to fly from Antarctica in December 2017 for 28 days and will be the first balloon-borne telescope to offer a quarter of the flight for "shared risk" observing by the community.Comment: Presented at SPIE Millimeter, Submillimeter, and Far-Infrared Detectors and Instrumentation for Astronomy VIII, June 29th, 201

    Characterization, deployment, and in-flight performance of the BLAST-TNG cryogenic receiver

    Full text link
    The Next Generation Balloon-borne Large Aperture Submillimeter Telescope (BLAST-TNG) is a submillimeter polarimeter designed to map interstellar dust and galactic foregrounds at 250, 350, and 500 microns during a 24-day Antarctic flight. The BLAST-TNG detector arrays are comprised of 918, 469, and 272 MKID pixels, respectively. The pixels are formed from two orthogonally oriented, crossed, linear-polarization sensitive MKID antennae. The arrays are cooled to sub 300mK temperatures and stabilized via a closed cycle 3^3He sorption fridge in combination with a 4^4He vacuum pot. The detectors are read out through a combination of the second-generation Reconfigurable Open Architecture Computing Hardware (ROACH2) and custom RF electronics designed for BLAST-TNG. The firmware and software designed to readout and characterize these detectors was built from scratch by the BLAST team around these detectors, and has been adapted for use by other MKID instruments such as TolTEC and OLIMPO. We present an overview of these systems as well as in-depth methodology of the ground-based characterization and the measured in-flight performance.Comment: Presented at SPIE Millimeter, Submillimeter, and Far-Infrared Detectors and Instrumentation for Astronomy X, December 13-18, 202
    corecore